feat: Gemma 4 day-0 support + token-level OutputRouter (v0.4.0) by raullenchai · Pull Request #58 · raullenchai/Rapid-MLX

raullenchai · 2026-03-26T20:58:26Z

Summary

Day-0 Gemma 4 full lineup + upstream sync + new OutputRouter architecture.

Depends on: #67 (upstream sync — merge first)

Gemma 4 benchmarks (M3 Ultra 256GB):

Model	Decode	TTFT cached	Tools	Leak	RAM
26B-A4B MoE 4bit	94 tok/s	252ms	100%	0%	14.4 GB
E4B 4bit	83 tok/s	253ms	100%	0%	6.4 GB
31B dense 4bit	31 tok/s	339ms	100%	0%	17.0 GB
31B bf16	10.9 tok/s	574ms	100%	0%	58.1 GB

New: OutputRouter — token-level routing, zero regex, 0% thinking leak.
18 tool parsers including Gemma 4 native format.
Agent-ready: OpenCode, Aider, LangChain, Cursor verified.

🤖 Generated with Claude Code

Replace per-token tokenizer.decode([token]) with a streaming detokenizer that buffers partial UTF-8 byte sequences. This fixes corrupted multi-byte characters (e.g. Czech 'ď' → '��') during SSE streaming, caused by byte-level tokens being decoded individually instead of accumulated until a complete UTF-8 character boundary. Uses mlx_lm's NaiveStreamingDetokenizer (or the optimized BPEStreamingDetokenizer when available via tokenizer.detokenizer) with a per-request pool that is cleaned up on request completion. Both LLM scheduler and MLLM scheduler are fixed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

…ol streaming - Add strict=False fallback in tokenizer loader for models with extra weights (e.g., vision tower params), enabling Qwen3.5 to load via mlx-lm as a text-only model - Fix streaming tool call parsing when both --reasoning-parser and --tool-call-parser are enabled (previously mutually exclusive branches) - Make memory pressure threshold dynamic based on system RAM instead of hardcoded 200GB Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixes AttributeError when ArraysCache.is_trimmable() returns True but the trim() method doesn't exist. Added hasattr check for trim before calling it in scheduler.py lines 772 and 802. Closes #145

…zation

…odels Qwen3.5 uses a hybrid architecture (Attention + Mamba/SSM layers), where `model.make_cache()` returns a mix of `KVCache` and `ArraysCache` objects. `ArraysCache.__init__()` requires a `size` parameter, but `BatchMambaCache` conditionally skipped it when `HAS_MAMBA_CACHE=True`. Since `MambaCache` was removed in mlx-lm >= 0.30.6 and falls back to `ArraysCache`, the `HAS_MAMBA_CACHE` flag is unreliable. This caused `--continuous-batching` mode to crash in an infinite error loop: `ArraysCache.__init__() missing 1 required positional argument: 'size'` The fix unconditionally passes `size` to `super().__init__()`, which is safe for both `ArraysCache` (requires it) and legacy `MambaCache` (accepts it). Without this fix, continuous batching and prefix caching are completely broken for Qwen3.5 models on Apple Silicon. Related upstream issues: - ml-explore/mlx-lm#980 (prefix cache fails for hybrid models) - QwenLM/Qwen3.6#37 (ArraysCache vs KVCache in hybrid arch)

mlx-lm 0.31.0 added prompt_checkpoints support, changing the BatchGenerator.insert() tuple from 6 elements to 7. This causes "ValueError: too many values to unpack (expected 6)" in _chunked_next when processing any request. Changes: - scheduler.py line ~395: unpack 7 values (add _prompt_checkpoints) - scheduler.py line ~406: pass max_kv_size=None to _make_cache() (signature changed in mlx-lm 0.31.0 to require 3 args) Tested on Mac Mini M4 Pro 64GB with: - mlx-lm 0.31.0 - mlx 0.31.1 - Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit - vllm-mlx 0.2.5 (this fork) Fixes the same issue as jundot/omlx#110.

Three bugs fixed: 1. video_url content type silently ignored in MLLM chat() and stream_chat(). The OpenAI API video format uses {"type": "video_url", "video_url": {"url": ...}} but only "video" type was handled. Fixes #120. 2. Video frames extracted AFTER chat template built, causing token count mismatch (template has 0 image tokens but vision encoder produces N*frame features). Restructured to two-pass approach: extract video frames first, then build chat template with correct frame counts. 3. server.py has_media always False in MLLM mode because images/videos are extracted from messages internally (set to []). Added MLLM-specific check so video_fps/video_max_frames params still reach chat() via chat_kwargs.

For models with video_token_id (Qwen-family), video inputs now flow through mlx-vlm's native video pipeline instead of being treated as individual images. This activates: - 3D conv frame pairing (temporal_patch_size=2) - M-RoPE temporal position IDs (interleaved layout) - Timestamp-frame interleaving in the prompt - Proper video_grid_thw for the vision encoder Falls back to frame-as-images for non-video models. Adds _generate_native_video() and _translate_messages_for_native_video() to MLXMultimodalLM, plus unit tests for video URL parsing, frame count alignment, and message translation.

…tion

… (#180) * feat: MLLM+MTP per-request routing for text and vision When both --mllm and --enable-mtp are set, SimpleEngine builds a parallel mlx_lm TextModel sharing the VLM backbone weights (zero-copy). Text-only requests route to mlx_lm with MTP speculative decoding; media requests route to the mlx_vlm MLLM path. Key components: - text_model_from_vlm.py: Build mlx_lm TextModel from VLM weights - Per-request routing in stream_chat() via _has_media_content() - _stream_generate_text() for MTP-accelerated text generation - MTP passthrough: --enable-mtp flag through CLI/server/engine/LLM Tested on Qwen3.5-35B-A3B VLM+MTP (8-bit): - Text (MTP): 65.3 tok/s - Vision (MLLM): 63.8 tok/s - Memory: 38.7 GB (zero-copy, same as single model) * feat: system prompt KV caching for SimpleEngine MTP text path Persist backbone KV cache after prefilling system prompt tokens. On subsequent requests with the same system prompt, restore the snapshot and only prefill the suffix (user + history) tokens. For a 10K-token system prompt on the 122B model, this saves ~57s per request by avoiding redundant system prompt prefill. Implementation: - Detect system prefix via ChatML boundary markers - Hash prefix text for cache key validation - On cache miss: prefill system tokens, snapshot backbone KV state - On cache hit: restore snapshot into fresh cache, send suffix only - Token prefix validation ensures correct split at tokenization boundary - Single-entry cache (one system prompt at a time) - Stats exposed via get_stats() → system_kv_cache - Cache cleared on stop(), invalidated on system prompt change * feat: SpecPrefill — attention-based sparse prefill for TTFT reduction Uses a small draft model to identify important prompt tokens via attention scoring, then sparse-prefills the target model with only those tokens while preserving original positional encoding via manual RoPE. Reduces TTFT 2.8-3.1x on 122B and 1.8x on 35B at 20% keep rate. Implementation: - specprefill.py: Core module with score_tokens(), select_chunks(), sparse_prefill(), cleanup_rope() (~640 lines) - SimpleEngine integration: draft model loading, threshold-based activation, composition with system prompt KV cache, graceful fallback on error - Per-request API: specprefill (bool) + specprefill_keep_pct (float) via extra_body for per-request control - CLI: --specprefill, --specprefill-threshold, --specprefill-keep-pct, --specprefill-draft-model, --prefill-step-size Closes #179. Related: #178 (TTFT), #57 (speculative decoding). * feat: multi-architecture support for SpecPrefill scoring and sparse prefill Add support for three model architecture families with auto-detection: - Qwen3.5: gate split + q_norm + RoPE (existing, now refactored) - Nemotron-H: content-based attention (no RoPE), mixer attr, compacted cache - GPT-OSS/Llama: standard q_proj + RoPE (GQA, YarnRoPE compatible) Key changes: - Architecture-specific query extractors (_qwen35, _llama, _nemotron_h) - Auto-detection in score_tokens() via model attributes (q_norm/rope/mixer) - _get_attn_module()/_set_attn_module() abstract self_attn vs mixer access - _find_attention_layers() handles block_type="*" (Nemotron-H attention) - _build_layer_to_cache_map() handles compacted cache indexing - sparse_prefill() skips RoPE patching for architectures without it - cleanup_rope() is no-op for RoPE-less architectures - Remove score_tokens_self() stub (CritiPrefill not viable for MoE) Tested on Qwen3.5 4B (positions + pipeline). Nemotron-H and GPT-OSS code paths ready for empirical validation. * fix: handle GPT-OSS sliding window caches and head attribute naming Two bugs found during cross-architecture testing on GPT-OSS 120B: 1. _llama_extract_queries() used eager evaluation in getattr fallback chain: getattr(attn, "num_attention_heads", attn.num_heads) evaluates attn.num_heads before checking if num_attention_heads exists. Fixed to use safe nested getattr with None default. 2. _compute_importance() concatenated score matrices with different shapes when mixing sliding window (128-token RotatingKVCache) and full attention (unlimited KVCache) layers. Fixed by skipping layers whose cache spans fewer tokens than the full prompt. Validated on GPT-OSS 120B + 20B draft: importance-based selection produces coherent output while uniform selection degrades, confirming scoring signal from 18 full-attention layers is sufficient. * fix: preserve tail tokens for models with RotatingKVCache Models with sliding window attention (e.g., GPT-OSS alternating sliding/full layers) use RotatingKVCache that evicts old entries. When sparse prefill inserts more tokens than the window size, the cache loses context needed for decode. sparse_prefill() now auto-detects RotatingKVCache and augments the selection to include the last max_size positions, ensuring sliding window layers have valid recent context. Validated: GPT-OSS 120B + 20B draft produces coherent output on 2294-token prompts (was garbage before this fix). Qwen3.5 and Nemotron-H unaffected (no RotatingKVCache in their cache). * feat: SpecPrefill support for non-MTP models (standard LLM path) Add _stream_generate_specprefill() method for models that don't use MTP speculative decoding (Nemotron, GPT-OSS, etc). The existing SpecPrefill integration only worked in the MTP text path (_stream_generate_text). Changes: - stream_generate() now pops specprefill/specprefill_keep_pct from kwargs and dispatches to the new method when conditions are met - _stream_generate_specprefill() follows the same pattern as the MTP path: score → select → sparse_prefill → autoregressive generation - Graceful fallback to normal generation on any error - Per-request overrides (specprefill, specprefill_keep_pct) via extra_body - Threshold and upper-bound checks identical to MTP path

…strict=False loader

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming

…enerate wiring - Forward tools to apply_chat_template in native video path (fixes silent tool-call drop, regression from PR #124) - Pop tools, use_cache, video_fps, video_max_frames from kwargs before native video branch in chat() and stream_chat() to prevent leaking into mlx_vlm.generate() - Extract _collect_video_inputs() to deduplicate video extraction between chat() and stream_chat() - Split _generate_native_video into _prepare_native_video_inputs (preprocessing) + _generate_native_video (generation) wired through mlx_vlm.video_generate for clearer intent and easier adoption of upstream improvements - Add ImportError guard on video_generate import in _generate_native_video to match codebase convention - Document blocking stream_chat native video path — no upstream streaming API, engine wraps in asyncio.to_thread() - Add tests for multi-message videos, multiple videos per message, video_url translation, Pydantic handling, tool forwarding, video_generate import verification

Add --served-model-name CLI parameter

…injection - ensure_mamba_support() now no-op: mlx-lm >= 0.30.6 ArraysCache has native batch support; old patch broke hybrid models (ArraysCache + KVCache) - Add inject_mtp_support(): dynamically create MTP module, load weights, and monkey-patch model class with return_hidden/mtp_forward/make_mtp_cache - Add _try_inject_mtp_post_load: auto-detect and inject MTP weights stripped by sanitize() during mlx_lm.load() - Add strict=False fallback for models with extra MTP parameters - validate_mtp_support: support model.language_model.args hierarchy - Improve engine loop error logging with full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: native Qwen3-VL video support in MLLM mode

…injection fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection

Truncating the string causes similar-but-not-the-same base64 JPGs to return the same hash, causing vllm-mlx to use the same cached image for all of them, resulting in duplicated and incorrect responses.

Benchmarked 14 models post-merge on Mac Studio M3 Ultra (256GB): Qwen family (100% tools): - Qwen3.5-4B 4bit: 161.5 tok/s, 2.9 GB - Qwen3.5-9B 4bit: 99.8 tok/s, 5.4 GB - Qwen3.5-27B 4bit: 39.0 tok/s, 14.8 GB - Qwen3.5-35B-A3B 8b: 83.1 tok/s, 35.0 GB - Qwen3-Coder-Next 4b: 74.5 tok/s, 42.4 GB - MiniMax-M2.5 4bit: 51.7 tok/s, 120.4 GB Non-Qwen: - Llama-3.2-3B: 226.5 tok/s (fastest, no tools) - Hermes-3-8B: 123.4 tok/s (no tools) - Phi-4-mini: 174.0 tok/s (no tools, 100% leak) - Gemma-3-12B: 48.4 tok/s (no tools) - Mistral-Small: pending (see json) - Devstral-24B: 29.6 tok/s (no tools) - GPT-OSS-20B: 58.5 tok/s (no tools) - GLM-4.5-Air: ~49 tok/s (100% tools) Agent integration verified: - LangChain: basic chat + tool calling ✓ - OpenAI SDK: streaming + tool calling ✓ - Aider-style: code editing + multi-turn ✓ - OpenCode-style: streaming tool calls ✓ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mlx-vlm 0.4.3 has two bugs loading Gemma 4 models: 1. sanitize() doubles the 'language_model.model.' prefix 2. MLX-format models skip sanitize entirely Our patch intercepts load_model(), fixes the prefix mapping, and force-reloads weights when scales are detected as all-zero. Linear layer weights load correctly with this patch. Embedding layer remains broken due to upstream quantization issue (scales are zero in the safetensors files themselves — mlx-community model bug). Upstream: Blaizzy/mlx-vlm#912 TODO: Remove patch once mlx-vlm fixes the bug. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mlx-vlm 0.4.3 supports Gemma 4 natively with bf16 weights (google/gemma-4-31b-it). The quantized model embedding issue is an upstream mlx-community quantization bug, not mlx-vlm's. Use bf16 original weights instead of patching around broken quants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The eager template validation in streaming chat called build_prompt() which throws RuntimeError for MLLM models. Skip the check when engine.is_mllm is true. Fixes Gemma 4 (and all MLLM) streaming 500 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Some VLM models (e.g. Gemma 4) raise concatenate errors during generator cleanup after generation completes. If we already have output tokens, log the warning and treat as finished instead of crashing the response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gemma 4 uses a native tool format: <|tool_call>call:name{k:<|"|>v<|"|>}<tool_call|> This parser handles both non-streaming and streaming extraction. Auto-detected for gemma4 model names. Registered as "gemma4" / "gemma_4". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ience - Global FastAPI exception handler catches unhandled errors and returns JSON 500 instead of killing the connection. Server stays alive. - MLLM chat() path now logs full traceback on failure before re-raising. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Load Gemma 4 via mlx-vlm's LanguageModel but route through the LLM path (not MLLM), enabling prompt cache, KV trim, and all decode optimizations. Gemma4TextWrapper adapts LanguageModelOutput → raw logits for mlx-lm generate_step() compatibility. Cache is fully trimmable (60 KVCache + RotatingKVCache layers). Auto-detected: gemma4 models skip --mllm and go through LLM path. Requires bf16 weights (quantized models have embedding issues). TODO: Remove once mlx-lm adds native gemma4 support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gemma 4 31B benchmark results (Mac Studio M3 Ultra 256GB): LLM path (4bit, with prompt cache): - Decode: 32 tok/s - TTFT cached: 242ms - Tool calling: 100% - RAM: 17 GB MLLM path (bf16, no cache): - Decode: 6.1 tok/s - TTFT: 874ms (no cache) - RAM: 59 GB LLM path advantage: 5.2x decode, 3.6x TTFT, 3.5x less RAM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

mlx_vlm.convert produces mixed-quant models (4bit default, 8bit MLP). Parse per-layer overrides from config and pass as class_predicate to nn.quantize(). Also fix tool parser super().reset() call. Tested: 31B 4bit (uniform), E4B 4bit (mixed) — both work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add <|channel>thought...<channel|> to THINK_PATTERN for non-streaming - Add <|turn>, <turn|> to special token filter - Non-streaming Gemma 4 output now clean (thinking stripped) - Streaming still leaks thinking (needs gemma4 reasoning parser — TODO) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix no-op ternary in override key processing - Use path.endswith(suffix) instead of substring match to prevent false positives (e.g., layers.0 matching layers.10) - Filter override config to only bits/group_size/mode keys Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Day-0 Gemma 4 support on Rapid-MLX: - Gemma 4 26B-A4B MoE: 71 tok/s, 100% tools, 16 GB RAM - Gemma 4 31B dense: 32 tok/s, 100% tools, 17 GB RAM - 5.2x faster than mlx-vlm (LLM path with prompt cache) - Custom gemma4 tool call parser (18th parser format) - TTFT: 0.24s cached (vs 0.87s mlx-vlm) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Custom parser for Gemma 4's channel-based thinking format: <|channel>thought\n...reasoning...<channel|> <|channel>content\n...answer...<channel|> Streaming: thinking goes to 'reasoning' field, answer to 'content'. Non-streaming: strip_thinking_tags removes thought blocks. No more thinking leakage in client output. Closes #62. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New architecture for separating thinking/content/tool_calls: - Token-ID based state machine (no regex, no text matching) - Config-driven: reads special token IDs from tokenizer vocabulary - Auto-detects Gemma 4 format from tokenizer vocab - Single unified interface for both streaming and non-streaming - 17 unit tests covering all routing scenarios Design: OutputRouter.from_tokenizer(tokenizer) → state machine that routes each token to CONTENT/REASONING/TOOL_CALL/CONTROL channels. Currently implements Gemma 4 (channel tokens + tool_call tokens). Future: migrate Qwen3 (<think>), DeepSeek, etc. to same architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Wire token-level OutputRouter into the full streaming path: - MLXLanguageModel.stream_generate() routes each token via router - StreamingOutput gains 'channel' field ("content"/"reasoning"/"tool_call") - GenerationOutput passes channel through SimpleEngine to server - Server's stream_chat_completion uses channel for direct routing, bypassing regex-based reasoning parser for router-enabled models Gemma 4 streaming now uses token-level routing: - Zero thinking leakage (verified: 4/4 integration tests) - Content/reasoning cleanly separated - Tool calls properly accumulated and emitted - Old regex parsers remain as fallback for non-router models Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

From 3-round review: - Init _output_router=None in __init__ (AttributeError risk) - Move token_count++ after router suppress check (inflated count) - Use vocab.get() in from_tokenizer (KeyError on partial vocab) - Lazy decode: skip tokenizer.decode() for suppressed control tokens - Add try/except around router.feed() with fallback to decoder - Remove dead control_ids property Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Final Gemma 4 benchmark results (Mac Studio M3 Ultra 256GB): | Model | Decode | TTFT cached | Tools | RAM | |---------------------|-----------|-------------|-------|---------| | Gemma 4 26B-A4B 4b | 93.5 t/s | 252ms | 100% | 14.4 GB | | Gemma 4 E4B 4bit | 82.8 t/s | 253ms | 100% | 6.4 GB | | Gemma 4 31B 4bit | 30.9 t/s | 339ms | 100% | 17.0 GB | | Gemma 4 31B bf16 | 10.9 t/s | 574ms | 100% | 58.1 GB | All models: 100% tool calling, 100% recovery, 0% leak. Agent integration: 9/10 (OpenAI, LangChain, OpenCode, Aider). Token-level OutputRouter: zero thinking leakage in streaming. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major release highlights: - Day-0 Gemma 4 full lineup: E4B (83 tok/s), 26B-A4B (94 tok/s), 31B (31 tok/s) - Token-level OutputRouter: config-driven channel routing, zero regex - Gemma 4 tool call parser (18th format) + reasoning parser - 5.2x faster than mlx-vlm MLLM path (LLM path + prompt cache) - 100% tool calling, 0% thinking leakage across all Gemma 4 models - Upstream sync: 43 commits from waybarrios/vllm-mlx (SpecPrefill, streaming filters, Anthropic think blocks, detokenizer, v0.2.7) - Global exception handler for production resilience - mlx-vlm >= 0.4.4, mlx-lm >= 0.31.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gemma 4 models can emit <tool_call|>, <tool_response|>, <|tool>, <tool|> without matching opening tags during multi-round degradation. These leaked into client output. Now suppressed at token level: - Orphan <tool_call|> outside TOOL_CALL state - <|tool_response>, <tool_response|>, <|tool>, <tool|> always suppressed Added 4 tests for orphan token handling. Total: 21 router tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Gemma 4 models degrade to [Calling tool: name({...})] format after multiple tool rounds at low quantization. The gemma4 tool parser now catches this pattern and converts it to structured tool_calls. Also triggers tool markup detection on '[' character (not just '<') so the streaming tool parser path activates for text-format calls. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Defense-in-depth: sanitize_output() runs on every content delta before reaching the client. Catches ANY remaining markup: - <|..> and <..|> asymmetric tokens (Gemma 4) - <|..|> symmetric tokens (Qwen, GPT-OSS) - [Calling tool:...] text-format degradation - Stray </think>, </tool_call> closing tags Applied in _fast_sse_chunk (hot path) and Pydantic chunk path. Better to over-strip than to leak. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Third streaming path missing '[' check in tool_markup_possible 2. Third streaming path Pydantic fallback missing sanitize_output() 3. Text-format tool call recovery had no deduplication (re-emitted same tool call on every subsequent delta) Also: guard empty SSE chunks from _fast_sse_chunk when sanitizer strips all content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Post-sanitizer benchmark + agent integration (M3 Ultra 256GB): | Model | Decode | Tools | Leak | Agent | |---------------------|----------|-------|------|-------| | 26B-A4B MoE 4bit | 93.5 t/s | 100% | 0% | 6/6 | | 31B dense 4bit | 31.0 t/s | 100% | 0% | 6/6 | | E4B 4bit | 82.2 t/s | 100% | 0% | 6/6 | Agent tests: OpenAI SDK, streaming, tools, streaming tools, LangChain bind_tools, multi-turn coding. All passed. 1989 unit tests + 21 OutputRouter tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…nc-march26 # Conflicts: # pyproject.toml

janhilgard and others added 30 commits February 24, 2026 13:47

Add --served-model-name CLI parameter

85bae64

Allow users to serve a model under a different name in API responses, matching vLLM's --served-model-name behavior.

Fix prefix cache dir using served name instead of model path

41b4e76

The cache directory was derived from _model_name which could be overridden by --served-model-name, causing cache misses when the served name changed. Use the actual model path instead.

fix: check trim method existence before calling

e765db8

Fixes AttributeError when ArraysCache.is_trimmable() returns True but the trim() method doesn't exist. Added hasattr check for trim before calling it in scheduler.py lines 772 and 802. Closes #145

fix(batched): add exclude_none=True to model_dump in image extraction

a445b23

fix: filter None values from dict() fallback and api/utils.py seriali…

295d690

…zation

fix(mllm_scheduler): add adaptive periodic cache clearing (#157)

80c6849

fix: rename platform.py to vllm_platform.py to avoid stdlib shadowing

b353aab

style: ruff format + lint fixes for new code

eb56c7d

Fix video native init, import guard, empty source and has_media detec…

92b3556

…tion

remove streaming tool fix (covered by #148) and fix eos_token_ids in …

d90486e

…strict=False loader

Add Qwen3.5 text-only loading and dynamic memory threshold (#127)

90eac21

Add Qwen3.5 model support (text-only) and fix reasoning+tool streaming

fix lint CI to use python 3.13 for black compatibility

913bfd0

format engine_core.py long line

0b07872

resolve merge conflicts with main

6e413f6

Merge pull request #125 from otarkhan/feature/served-model-name

c609b59

Add --served-model-name CLI parameter

resolve merge conflicts with main

35c77ec

format test_video.py

ede4e30

Merge pull request #150 from patanet7/feat/native-video-support

2a79216

feat: native Qwen3-VL video support in MLLM mode

remove dead code in _load_strict_false

74c2f02

Merge pull request #97 from janhilgard/fix/hybrid-model-batching-mtp-…

d235c37

…injection fix: Disable MambaCache monkey-patch for hybrid models, add MTP auto-injection

Don’t truncate base64 images before hashing.

8dd33e7

Truncating the string causes similar-but-not-the-same base64 JPGs to return the same hash, causing vllm-mlx to use the same cached image for all of them, resulting in duplicated and incorrect responses.

Your Name and others added 10 commits April 3, 2026 15:15

deps: bump mlx-vlm minimum to 0.4.4 for Gemma 4 support

2ecb9a0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

raullenchai force-pushed the feat/upstream-sync-march26 branch from 22b93aa to c1a1713 Compare April 5, 2026 10:44

Your Name and others added 10 commits April 5, 2026 04:13

raullenchai changed the title ~~Sync upstream: SpecPrefill, native video, MTP injection~~ feat: Gemma 4 day-0 support + token-level OutputRouter (v0.4.0) Apr 6, 2026

Your Name and others added 6 commits April 6, 2026 03:14

Merge remote-tracking branch 'raullenchai/main' into feat/upstream-sy…

6cb0f38

…nc-march26 # Conflicts: # pyproject.toml

raullenchai merged commit e642ce1 into main Apr 6, 2026
5 of 6 checks passed

raullenchai mentioned this pull request Apr 7, 2026

feat(gemma4): add Gemma 4 tool call parser #68

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Gemma 4 day-0 support + token-level OutputRouter (v0.4.0)#58

feat: Gemma 4 day-0 support + token-level OutputRouter (v0.4.0)#58
raullenchai merged 94 commits intomainfrom
feat/upstream-sync-march26

raullenchai commented Mar 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

raullenchai commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

raullenchai commented Mar 26, 2026 •

edited

Loading